Taxi Exploration by Wilfried Hoge

The Taxi data used in this exploration was downloaded from http://www.andresmh.com/nyctaxitrips/ and joined with weather data from https://weatherspark.com/. This data set is very large and only a small sample is used here. This sample contains 1% of the data for 4 month (January to April 2013).

## [1] "C"

General Statistics

# show the structure of the taxi data
str(taxi)
## 'data.frame':    60003 obs. of  28 variables:
##  $ Medallion        : Factor w/ 12934 levels "00005007A9F30E289E760362F69E4EAD",..: 8407 12562 11442 1754 2919 6413 6237 7548 6036 8866 ...
##  $ Vendor           : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pickup.datetime  : chr  "2013-04-15 20:30:00.0" "2013-04-15 23:55:00.0" "2013-04-16 07:59:00.0" "2013-04-04 13:41:00.0" ...
##  $ dropoff.datetime : chr  "2013-04-15 20:39:00.0" "2013-04-15 23:59:00.0" "2013-04-16 08:14:00.0" "2013-04-04 13:46:00.0" ...
##  $ passenger.count  : Factor w/ 7 levels "0","1","2","3",..: 2 3 4 2 2 2 2 2 6 2 ...
##  $ trip.time        : int  540 240 900 300 1380 480 600 480 600 300 ...
##  $ trip.distance    : num  1.63 1.33 2.86 0.57 9.58 0.77 2.63 2.69 2.35 1.28 ...
##  $ pickup.longitude : num  -74 -73.9 -74 -74 -74 ...
##  $ pickup.latitude  : num  40.8 40.8 40.8 40.8 40.8 ...
##  $ dropoff.longitude: num  -74 -74 -74 -74 -73.9 ...
##  $ dropoff.latitude : num  40.8 40.8 40.8 40.7 40.8 ...
##  $ payment.type     : Factor w/ 5 levels "CRD","CSH","DIS",..: 2 2 1 2 1 2 1 1 2 2 ...
##  $ fare.amount      : num  8.5 6 12.5 5 29 7 10.5 9.5 10 6.5 ...
##  $ Surcharge        : num  0.5 0.5 0 0 0 0 0 0.5 0 0 ...
##  $ mta.tax          : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip.amount       : num  0 0 1 0 5 0 3 2 0 0 ...
##  $ tolls.amount     : num  0 0 0 0 5.33 0 0 0 0 0 ...
##  $ total.amount     : num  9.5 7 14.5 5.5 40.6 ...
##  $ Year             : Factor w/ 1 level "2013": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Month            : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ Day              : Factor w/ 31 levels "1","10","11",..: 7 7 8 26 26 22 16 15 22 16 ...
##  $ Hour             : Factor w/ 24 levels "0","1","2","3",..: 21 24 8 14 15 8 8 24 8 7 ...
##  $ Temperature      : num  9.4 8.9 9.4 8.3 9.4 11.7 5 7.8 11.7 5 ...
##  $ Precipitation    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pickup.date      : Date, format: "2013-04-15" "2013-04-15" ...
##  $ weekday          : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 2 4 4 1 2 1 1 2 ...
##  $ weekend          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ rain             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# show the factors
levels(taxi$Vendor)
## [1] "CMT" "VTS"
levels(taxi$passenger.count)
## [1] "0" "1" "2" "3" "4" "5" "6"
levels(taxi$Hour)
##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13"
## [15] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23"
levels(taxi$weekday)
## [1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
## [7] "Sunday"
levels(taxi$weekend)
## [1] "no"  "yes"
levels(taxi$rain)
## [1] "no"  "yes"
# remove NAs from the data set
taxi = na.omit(taxi)

# a summary of the data
summary(taxi)
##                             Medallion     Vendor      pickup.datetime   
##  6DFD37A4BDC448C365B36465D73A4CCE:   18   CMT:30506   Length:59894      
##  5E9BC99D16CFF51F5BB3361660713D1D:   16   VTS:29388   Class :character  
##  630D390996EED8370C1E5494B9EFED1F:   16               Mode  :character  
##  6F9EC82D4E5C8B03A93FC1C5DAB16465:   16                                 
##  7A244CCB309CDA8071892F11902ADC5C:   16                                 
##  0FF5BCE95C86107079CED655BD5D5BCA:   15                                 
##  (Other)                         :59797                                 
##  dropoff.datetime   passenger.count   trip.time      trip.distance   
##  Length:59894       0:    0         Min.   :   0.0   Min.   : 0.000  
##  Class :character   1:42365         1st Qu.: 360.0   1st Qu.: 1.010  
##  Mode  :character   2: 7998         Median : 600.0   Median : 1.720  
##                     3: 2451         Mean   : 713.8   Mean   : 2.804  
##                     4: 1199         3rd Qu.: 910.0   3rd Qu.: 3.100  
##                     5: 3578         Max.   :8461.0   Max.   :69.560  
##                     6: 2303                                          
##  pickup.longitude pickup.latitude dropoff.longitude dropoff.latitude
##  Min.   :-74.54   Min.   : 0.00   Min.   :-74.83    Min.   : 0.00   
##  1st Qu.:-73.99   1st Qu.:40.74   1st Qu.:-73.99    1st Qu.:40.73   
##  Median :-73.98   Median :40.75   Median :-73.98    Median :40.75   
##  Mean   :-72.63   Mean   :40.01   Mean   :-72.57    Mean   :39.99   
##  3rd Qu.:-73.97   3rd Qu.:40.77   3rd Qu.:-73.96    3rd Qu.:40.77   
##  Max.   :  0.00   Max.   :45.60   Max.   :  0.00    Max.   :73.98   
##                                                                     
##  payment.type  fare.amount       Surcharge         mta.tax      
##  CRD:32063    Min.   :  2.50   Min.   :0.0000   Min.   :0.0000  
##  CSH:27629    1st Qu.:  6.50   1st Qu.:0.0000   1st Qu.:0.5000  
##  DIS:   46    Median :  9.00   Median :0.0000   Median :0.5000  
##  NOC:  117    Mean   : 11.97   Mean   :0.3201   Mean   :0.4983  
##  UNK:   39    3rd Qu.: 13.50   3rd Qu.:0.5000   3rd Qu.:0.5000  
##               Max.   :356.00   Max.   :1.5000   Max.   :0.5000  
##                                                                 
##    tip.amount       tolls.amount      total.amount      Year      
##  Min.   :  0.000   Min.   : 0.0000   Min.   :  3.00   2013:59894  
##  1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:  8.00               
##  Median :  1.000   Median : 0.0000   Median : 10.80               
##  Mean   :  1.157   Mean   : 0.2238   Mean   : 14.33               
##  3rd Qu.:  2.000   3rd Qu.: 0.0000   3rd Qu.: 16.00               
##  Max.   :110.000   Max.   :18.5000   Max.   :356.00               
##                                                                   
##  Month          Day             Hour        Temperature     
##  1:14826   23     : 2232   19     : 3802   Min.   :-11.700  
##  2:13941   16     : 2146   18     : 3670   1st Qu.:  0.600  
##  3:15794   22     : 2114   20     : 3539   Median :  4.400  
##  4:15333   15     : 2109   21     : 3524   Mean   :  4.951  
##            2      : 2082   22     : 3355   3rd Qu.:  8.900  
##            19     : 2051   14     : 3112   Max.   : 27.800  
##            (Other):47160   (Other):38892                    
##  Precipitation      pickup.date              weekday     weekend    
##  Min.   :0.00000   Min.   :2013-01-01   Monday   :7554   no :42802  
##  1st Qu.:0.00000   1st Qu.:2013-02-01   Tuesday  :8724   yes:17092  
##  Median :0.00000   Median :2013-03-03   Wednesday:8410              
##  Mean   :0.09256   Mean   :2013-03-02   Thursday :8803              
##  3rd Qu.:0.00000   3rd Qu.:2013-04-01   Friday   :9311              
##  Max.   :8.89000   Max.   :2013-04-30   Saturday :9121              
##                                         Sunday   :7971              
##   rain      
##  no :55143  
##  yes: 4751  
##             
##             
##             
##             
## 

Most taxi trips are just short. The mean trip distance is below 3 miles. The mean passenger count is 1.7. The median tip is just $1 but the maximum is $110. The mean trip fare is $14.33 and the maximum is $356.

Simple plots

Looking at the taxi rides per day shows a distribution between 400 and 600. The monday has the lowest number of rides with a median of 450. The highest median of taxi rides is on Friday and Saturday with a higher variance on Saturday. There are not many taxi rides when it is raining.

Looking at fares

The distribution of fares over days shows that most trips are below $25 (the mean is $14.33). But a line above $50 stands out that should be investiged in more detail.

Looking at trip fares above $50 we can see that the rate of $52 is very frequent.

# subset of taxi trips with fare > 50$
taxi50 = taxi[taxi$fare.amount > 50,]
table(taxi50$fare.amount)
## 
##   50.5     51   51.5     52   52.5     53   53.5     54   54.5     55 
##      6      8      7    984      8      7      6      8      5      8 
##   55.5     56   56.5     57   57.3   57.5     58   58.5     59   59.5 
##      2      7      7      1      1      6      9      3      4      3 
##     60   60.5     61   61.5     62   62.5     63   63.5     64   64.5 
##     18      7      3      4      5      4      3      2      9      5 
##     65   65.5     66   66.5     67   67.5     68   68.5     69   69.5 
##      7      2      4      1      4      4      5      4      4      4 
##     70   70.5     71     72   72.5     73     74   74.5     75     76 
##      8      2      2      4      2      2      3      1      2      1 
##   76.5     77   77.5     79     80   80.5     81     84   84.5     85 
##      2      1      1      1      7      1      3      1      1      3 
##   85.5  86.01     88     90   93.5     97     98    100    102  102.5 
##      1      1      1      1      1      1      1      5      2      1 
##  106.5    110    112    115  118.5    120 122.22    123    125    130 
##      1      1      1      1      2      3      1      1      1      2 
##    140    163    178    200    202    204    250    268    356 
##      1      1      1      1      1      1      1      1      1

This high frequency of this special rate looks like a fixed price offering from/to the airport. To verify this, the geo coordinates are checked to look at the start and end point of the trips.

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.7,-73.9&zoom=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false

The maps show that the special price trips start/end at the airport and end/start in Manhattan.

Look at trips to Airport in more detail

Arriving the airport in time is important. Therefore we look at the time it takes to ride from the city of Manhattan to the airport.

The trip time to the airport is dependent on the hour of the day. Around 4pm in the afternoon it takes much longer than around 5am. The trip times on the weekend are mostly below the fitting line, showing that it is easier to drive to the airport on weekends.

Trying to predict the trip time to the airport

As it is important to know how long it will take to drive to the airport a prediction model is created. The model is not giving a good prediction (R squared value below 0.35). The reason is the small dataset and the large variance of the trip times on some hours of the day. The graph below shows the distribution of trip times to the airport, the predicted trip times (in blue) and the upper limit of the 99% confidence interval (in red).

lm1 = lm(trip.time ~ Hour, data=taxi50.to.ap)
summary(lm1)
## 
## Call:
## lm(formula = trip.time ~ Hour, data = taxi50.to.ap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1525.2  -404.7   -81.1   243.5  5928.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2249.50     375.63   5.989 5.51e-09 ***
## Hour1        -579.50     650.60  -0.891   0.3737    
## Hour3        -697.75     531.22  -1.313   0.1899    
## Hour4        -621.57     422.75  -1.470   0.1424    
## Hour5        -670.54     406.98  -1.648   0.1004    
## Hour6        -513.42     403.49  -1.272   0.2041    
## Hour7        -407.68     438.64  -0.929   0.3533    
## Hour8          57.83     451.45   0.128   0.8981    
## Hour9          27.21     470.87   0.058   0.9539    
## Hour10       -224.60     444.45  -0.505   0.6137    
## Hour11       -274.07     425.92  -0.643   0.5204    
## Hour12        -21.75     405.72  -0.054   0.9573    
## Hour13        107.58     405.72   0.265   0.7910    
## Hour14        539.17     399.88   1.348   0.1785    
## Hour15        895.71     397.74   2.252   0.0250 *  
## Hour16        722.28     402.49   1.795   0.0736 .  
## Hour17        739.74     404.56   1.828   0.0684 .  
## Hour18        533.02     406.98   1.310   0.1912    
## Hour19        282.96     429.55   0.659   0.5105    
## Hour20       -338.42     433.74  -0.780   0.4358    
## Hour21          4.25     531.22   0.008   0.9936    
## Hour22       -610.50     503.96  -1.211   0.2266    
## Hour23       -229.17     451.45  -0.508   0.6121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 751.3 on 331 degrees of freedom
## Multiple R-squared:  0.3461, Adjusted R-squared:  0.3027 
## F-statistic: 7.965 on 22 and 331 DF,  p-value: < 2.2e-16
# build a test data set to predict (5am to 8pm on workdays)
test=data.frame(Hour=factor(c(5:20),levels=c(5:20)), weekend="no")

# do prediction on test data
pred = predict(lm1, test, interval = c("confidence"), level = 0.99)
pred = data.frame(pred)
test$fit = pred$fit
test$upr = pred$upr

# show predicted data and upper limit for CI
test
##    Hour weekend      fit      upr
## 1     5      no 1578.957 1984.792
## 2     6      no 1736.077 2117.781
## 3     7      no 1841.818 2428.655
## 4     8      no 2307.333 2956.106
## 5     9      no 2276.714 3012.353
## 6    10      no 2024.900 2640.380
## 7    11      no 1975.429 2495.604
## 8    12      no 2227.750 2625.041
## 9    13      no 2357.083 2754.374
## 10   14      no 2788.667 3144.014
## 11   15      no 3145.212 3484.023
## 12   16      no 2971.778 3346.347
## 13   17      no 2989.240 3378.504
## 14   18      no 2782.522 3188.357
## 15   19      no 2532.462 3072.273
## 16   20      no 1911.083 2472.937

Looking at fare per mile and speed

The taxi fare per mile and speed are distributed as expected. The mean fare per mile is 5.87$ and the mean speed is 13.3 miles per hour. The histogram for the fair per mile has a long tail. Cutting of the upper 0.01% of the data gives a better overview of the distribution.

# Subselecting taxi trips with distance and trip time > 0
taxi0 = taxi[taxi$trip.distance>0 & taxi$trip.time>0,]

# calculate new varaibles for fare.per.mile and speed 
taxi0$fare.per.mile = taxi0$fare.amount/taxi0$trip.distance
taxi0$speed = taxi0$trip.distance/taxi0$trip.time*3600

# omit trips with speed too high
taxi0 = taxi0[taxi0$speed < 100,]

# distribution of new variables
summary(taxi0$fare.per.mile)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.1008   4.0450   5.0000   5.8690   6.3640 700.0000
summary(taxi0$speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.060   8.829  11.830  13.270  15.900  96.920

Comparing the fair per mile for workdays and weekends shows that the fare is slightly higher for workdays. This might be due to the higher traffic compared to the weekend.

Looking at the speed per weekday you can see that the median speed for Sunday is the highest and Friday has the lowest.

Comparing the speed over the hour of day shows smaller number of taxi rides in the early morning hours and the slowest speeds in the early afternoon. The smoothed line for the mean of the speed visualizes the dependency of speed from hour of day.

Seperating the speed by weekend or workdays in a boxplot shows that nightly trips on weekends (Saturday or Sunday morning) are slower than on worksdays. During daytime the speeds are faster on weekends than on workdays.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Looking at the influence of rain

Before comparing taxi rides on weather data three plots visualize the weather data. The temperature is shown as points and tiles whereas the rain is shown as tiles only. There are some “holes”, where no weather data is available, because no taxi trips are in the sample at that point in time.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Comparing the speed of the taxis by rain visually gives no information. Looking at the mean and median values, a small difference visible. Taxis are slower and the fares are higher, when it is raining. Also, the trip distance is shorter when it is raining, pointing at the fact that a taxi might be used even for smaller distances.

by(taxi0[,c("rain","speed","trip.distance","tip.amount")], taxi0$rain, summary)
## taxi0$rain: no
##   rain           speed       trip.distance     tip.amount    
##  no :54692   Min.   : 0.06   Min.   : 0.01   Min.   :  0.00  
##  yes:    0   1st Qu.: 8.88   1st Qu.: 1.05   1st Qu.:  0.00  
##              Median :11.91   Median : 1.76   Median :  1.00  
##              Mean   :13.35   Mean   : 2.83   Mean   :  1.14  
##              3rd Qu.:16.00   3rd Qu.: 3.14   3rd Qu.:  2.00  
##              Max.   :96.92   Max.   :69.56   Max.   :110.00  
## -------------------------------------------------------- 
## taxi0$rain: yes
##   rain          speed        trip.distance      tip.amount    
##  no :   0   Min.   : 0.600   Min.   : 0.030   Min.   : 0.000  
##  yes:4715   1st Qu.: 8.258   1st Qu.: 1.000   1st Qu.: 0.000  
##             Median :10.935   Median : 1.700   Median : 1.000  
##             Mean   :12.358   Mean   : 2.735   Mean   : 1.159  
##             3rd Qu.:14.667   3rd Qu.: 3.000   3rd Qu.: 2.000  
##             Max.   :71.111   Max.   :24.700   Max.   :20.000

Looking at tips

The distribution of tips over trip distance shows that many people don’t give tips at all. For longer distances, the tips rises a little bit but on average less than $1 a mile.

Even more interesting is the fact that the mean tip is less for 3 or 4 passengers in a taxi than it is for 1, 2 or 5, and 6 passengers. I have no idea what might be the cause of this fact.

## taxi0$passenger.count: 0
## NULL
## -------------------------------------------------------- 
## taxi0$passenger.count: 1
##    tip.amount     passenger.count
##  Min.   : 0.000   0:    0        
##  1st Qu.: 0.000   1:41971        
##  Median : 1.000   2:    0        
##  Mean   : 1.162   3:    0        
##  3rd Qu.: 2.000   4:    0        
##  Max.   :53.000   5:    0        
##                   6:    0        
## -------------------------------------------------------- 
## taxi0$passenger.count: 2
##    tip.amount     passenger.count
##  Min.   : 0.000   0:   0         
##  1st Qu.: 0.000   1:   0         
##  Median : 0.000   2:7952         
##  Mean   : 1.107   3:   0         
##  3rd Qu.: 2.000   4:   0         
##  Max.   :40.000   5:   0         
##                   6:   0         
## -------------------------------------------------------- 
## taxi0$passenger.count: 3
##    tip.amount     passenger.count
##  Min.   : 0.000   0:   0         
##  1st Qu.: 0.000   1:   0         
##  Median : 0.000   2:   0         
##  Mean   : 1.008   3:2440         
##  3rd Qu.: 1.000   4:   0         
##  Max.   :18.000   5:   0         
##                   6:   0         
## -------------------------------------------------------- 
## taxi0$passenger.count: 4
##    tip.amount      passenger.count
##  Min.   : 0.0000   0:   0         
##  1st Qu.: 0.0000   1:   0         
##  Median : 0.0000   2:   0         
##  Mean   : 0.9723   3:   0         
##  3rd Qu.: 1.0000   4:1191         
##  Max.   :16.0000   5:   0         
##                    6:   0         
## -------------------------------------------------------- 
## taxi0$passenger.count: 5
##    tip.amount     passenger.count
##  Min.   : 0.000   0:   0         
##  1st Qu.: 0.000   1:   0         
##  Median : 1.000   2:   0         
##  Mean   : 1.134   3:   0         
##  3rd Qu.: 2.000   4:   0         
##  Max.   :15.000   5:3560         
##                   6:   0         
## -------------------------------------------------------- 
## taxi0$passenger.count: 6
##    tip.amount      passenger.count
##  Min.   :  0.000   0:   0         
##  1st Qu.:  0.000   1:   0         
##  Median :  0.000   2:   0         
##  Mean   :  1.115   3:   0         
##  3rd Qu.:  1.000   4:   0         
##  Max.   :110.000   5:   0         
##                    6:2293

Final Plots and Summary

Plot 1

Description 1

The distribution of fares over days shows that most trips are below $25 (the mean is $14.33). But a line above $50 stands out. A special fix price offering to get from the city to the airport and backward might be the reason.

Plot 2

Description 2

The trip time to the airport is dependent on the hour of the day. Around 4pm in the afternoon it takes much longer than around 5am. The trip times on the weekend are mostly below the fitting line, showing that it is easier to drive to the airport on weekends.

Plot 3

Description 3

The distribution of tips over trip distance shows that many people don’t give tips at all. For longer distances, the tips rises a little bit but on average less than $1 a mile.

Reflection

The NYC Taxi data set contains a large amount of data. It has more than 10 million taxi trips per month. In this investigation just a small subset of the trip data is analyzed (60.000 trips over 4 month).

I started by looking at the distribution of taxi trips over date and weekdays but this gives not much information, as the distribution is almost even. Looking at the taxi fares showed a special price and it turned out that going from Manhattan to JFK airport or airport to Manhattan is the dominant taxi trip behind this special rate. Looking into trip times from Manhattan to JFS airport shows a strong dependency on hour of day. In the afternoon it takes 2x more time to arrive at the airport. I created a prediction model for this but the quality of this model was not very good based on the sample. It would be interesting to create and test the model with the full data set. Another interesting insight was that the mean tip amount is just around $1. As a German I have expected much higher tips. It is also interesting that with 4 passengers in a taxi the mean of tips is the lowest.

There are numerous relationships that are not investigated here. E.g. the trip distance over time or the number of passengers compared by night or day could be also interesting. Looking at the full data set would give additional insights, that are not possible from the sample. E.g. it could be analyzed how many taxis are on the street in a given time frame or how many passengers each taxi has or how long taxis have to wait between customers.